import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
colors = ['#235E72']
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')
imdb = pd.read_csv('imdb_movies_india.csv', encoding='latin-1')
imdb.head()
| Name | Year | Duration | Genre | Rating | Votes | Director | Actor 1 | Actor 2 | Actor 3 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | Drama | NaN | NaN | J.S. Randhawa | Manmauji | Birbal | Rajendra Bhatia | |
| 1 | #Gadhvi (He thought he was Gandhi) | (2019) | 109 min | Drama | 7.0 | 8 | Gaurav Bakshi | Rasika Dugal | Vivek Ghamande | Arvind Jangid |
| 2 | #Homecoming | (2021) | 90 min | Drama, Musical | NaN | NaN | Soumyajit Majumdar | Sayani Gupta | Plabita Borthakur | Roy Angana |
| 3 | #Yaaram | (2019) | 110 min | Comedy, Romance | 4.4 | 35 | Ovais Khan | Prateik | Ishita Raj | Siddhant Kapoor |
| 4 | ...And Once Again | (2010) | 105 min | Drama | NaN | NaN | Amol Palekar | Rajat Kapoor | Rituparna Sengupta | Antara Mali |
imdb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15509 entries, 0 to 15508 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 15509 non-null object 1 Year 14981 non-null object 2 Duration 7240 non-null object 3 Genre 13632 non-null object 4 Rating 7919 non-null float64 5 Votes 7920 non-null object 6 Director 14984 non-null object 7 Actor 1 13892 non-null object 8 Actor 2 13125 non-null object 9 Actor 3 12365 non-null object dtypes: float64(1), object(9) memory usage: 1.2+ MB
# Checking null values
imdb.isna().sum()
Name 0 Year 528 Duration 8269 Genre 1877 Rating 7590 Votes 7589 Director 525 Actor 1 1617 Actor 2 2384 Actor 3 3144 dtype: int64
# Locating rows with missing values in columns from 1 to 9
nulls = imdb[imdb.iloc[:, 1:9].isna().all(axis=1)]
nulls.head()
| Name | Year | Duration | Genre | Rating | Votes | Director | Actor 1 | Actor 2 | Actor 3 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1836 | Bang Bang Reloaded | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1920 | Battle of bittora | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2653 | Campus | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3403 | Dancing Dad | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3807 | Dial 100 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
#Checking if there are any typos
for col in imdb.select_dtypes(include = "object"):
print(f"Name of Column: {col}")
print(imdb[col].unique())
print('\n', '-'*60, '\n')
Name of Column: Name [' ' '#Gadhvi (He thought he was Gandhi)' '#Homecoming' ... 'Zulmi Raj' 'Zulmi Shikari' 'Zulm-O-Sitam'] ------------------------------------------------------------ Name of Column: Year [nan '(2019)' '(2021)' '(2010)' '(1997)' '(2005)' '(2008)' '(2012)' '(2014)' '(2004)' '(2016)' '(1991)' '(1990)' '(2018)' '(1987)' '(1948)' '(1958)' '(2017)' '(2020)' '(2009)' '(2002)' '(1993)' '(1946)' '(1994)' '(2007)' '(2013)' '(2003)' '(1998)' '(1979)' '(1951)' '(1956)' '(1974)' '(2015)' '(2006)' '(1981)' '(1985)' '(2011)' '(2001)' '(1967)' '(1988)' '(1995)' '(1959)' '(1996)' '(1970)' '(1976)' '(2000)' '(1999)' '(1973)' '(1968)' '(1943)' '(1953)' '(1986)' '(1983)' '(1989)' '(1982)' '(1977)' '(1957)' '(1950)' '(1992)' '(1969)' '(1975)' '(1947)' '(1972)' '(1971)' '(1935)' '(1978)' '(1960)' '(1944)' '(1963)' '(1940)' '(1984)' '(1934)' '(1955)' '(1936)' '(1980)' '(1966)' '(1949)' '(1962)' '(1964)' '(1952)' '(1933)' '(1942)' '(1939)' '(1954)' '(1945)' '(1961)' '(1965)' '(1938)' '(1941)' '(1931)' '(1937)' '(2022)' '(1932)' '(1923)' '(1915)' '(1928)' '(1922)' '(1917)' '(1913)' '(1930)' '(1926)' '(1914)' '(1924)'] ------------------------------------------------------------ Name of Column: Duration [nan '109 min' '90 min' '110 min' '105 min' '147 min' '142 min' '59 min' '82 min' '116 min' '96 min' '120 min' '161 min' '166 min' '102 min' '87 min' '132 min' '66 min' '146 min' '112 min' '168 min' '158 min' '126 min' '94 min' '138 min' '124 min' '144 min' '157 min' '136 min' '107 min' '113 min' '80 min' '122 min' '149 min' '148 min' '130 min' '121 min' '188 min' '115 min' '103 min' '114 min' '170 min' '100 min' '99 min' '140 min' '128 min' '93 min' '125 min' '145 min' '75 min' '111 min' '134 min' '85 min' '104 min' '92 min' '137 min' '127 min' '150 min' '119 min' '135 min' '86 min' '76 min' '70 min' '72 min' '151 min' '95 min' '52 min' '89 min' '143 min' '177 min' '117 min' '123 min' '154 min' '88 min' '175 min' '153 min' '78 min' '139 min' '133 min' '101 min' '180 min' '60 min' '46 min' '164 min' '162 min' '171 min' '160 min' '152 min' '62 min' '163 min' '165 min' '141 min' '210 min' '129 min' '156 min' '240 min' '172 min' '155 min' '118 min' '167 min' '106 min' '193 min' '57 min' '108 min' '45 min' '195 min' '174 min' '81 min' '178 min' '58 min' '184 min' '97 min' '98 min' '131 min' '176 min' '169 min' '77 min' '91 min' '84 min' '173 min' '74 min' '67 min' '181 min' '300 min' '79 min' '65 min' '48 min' '183 min' '159 min' '83 min' '68 min' '49 min' '201 min' '64 min' '186 min' '50 min' '69 min' '207 min' '55 min' '61 min' '185 min' '187 min' '216 min' '63 min' '54 min' '198 min' '51 min' '71 min' '73 min' '218 min' '191 min' '321 min' '199 min' '53 min' '56 min' '179 min' '47 min' '206 min' '190 min' '211 min' '247 min' '213 min' '223 min' '2 min' '189 min' '224 min' '202 min' '255 min' '197 min' '182 min' '214 min' '208 min' '21 min' '200 min' '192 min' '37 min' '261 min' '238 min' '204 min' '235 min' '298 min' '217 min' '250 min'] ------------------------------------------------------------ Name of Column: Genre ['Drama' 'Drama, Musical' 'Comedy, Romance' 'Comedy, Drama, Musical' 'Drama, Romance, War' 'Documentary' 'Horror, Mystery, Thriller' 'Action, Crime, Thriller' 'Horror' 'Horror, Romance, Thriller' 'Comedy, Drama, Romance' 'Thriller' 'Comedy, Drama' nan 'Comedy, Drama, Fantasy' 'Comedy, Drama, Family' 'Crime, Drama, Mystery' 'Horror, Thriller' 'Biography' 'Comedy, Horror' 'Action' 'Drama, Horror, Mystery' 'Comedy' 'Action, Thriller' 'Drama, History' 'Drama, History, Sport' 'Horror, Mystery, Romance' 'Horror, Mystery' 'Drama, Horror, Romance' 'Action, Drama, History' 'Action, Drama, War' 'Comedy, Family' 'Adventure, Horror, Mystery' 'Action, Sci-Fi' 'Crime, Mystery, Thriller' 'War' 'Sport' 'Biography, Drama, History' 'Horror, Romance' 'Crime, Drama' 'Drama, Romance' 'Adventure, Drama' 'Comedy, Mystery, Thriller' 'Action, Crime, Drama' 'Crime, Thriller' 'Horror, Sci-Fi, Thriller' 'Crime, Drama, Thriller' 'Drama, Mystery, Thriller' 'Drama, Sport' 'Drama, Family, Musical' 'Action, Comedy' 'Comedy, Thriller' 'Action, Adventure, Fantasy' 'Drama, Romance, Thriller' 'Action, Drama' 'Drama, Horror, Musical' 'Action, Biography, Drama' 'Adventure, Comedy, Drama' 'Mystery' 'Action, Fantasy, Mystery' 'Adventure, Drama, Mystery' 'Mystery, Thriller' 'Adventure' 'Drama, Musical, Thriller' 'Comedy, Crime, Drama' 'Musical, Romance' 'Documentary, Music' 'Documentary, History, Music' 'Drama, Fantasy, Mystery' 'Drama, Family, Sport' 'Drama, Thriller' 'Documentary, Biography' 'Action, Adventure, Comedy' 'Romance' 'Comedy, Drama, Music' 'Comedy, Horror, Mystery' 'Musical' 'Musical, Romance, Drama' 'Family, Romance' 'Action, Sci-Fi, Thriller' 'Action, Drama, Romance' 'Mystery, Romance' 'Fantasy' 'Family' 'Drama, Family' 'Action, Comedy, Drama' 'Action, Drama, Thriller' 'Drama, Horror, Thriller' 'Drama, Musical, Romance' 'Comedy, Sci-Fi' 'Action, Romance' 'Action, Crime' 'Action, Drama, Crime' 'Drama, Family, Music' 'Action, Mystery, Thriller' 'Action, Drama, Family' 'Action, Mystery' 'Drama, History, Romance' 'Crime, Drama, Music' 'Sci-Fi' 'Animation' 'Crime, Mystery, Romance' 'Action, Adventure, Romance' 'Music, Romance' 'Action, Comedy, Crime' 'Comedy, Family, Fantasy' 'Romance, Drama' 'Drama, Family, Romance' 'Romance, Drama, Family' 'Musical, Romance, Thriller' 'Family, Musical, Romance' 'Action, Drama, Fantasy' 'Family, Drama' 'Crime, Drama, Romance' 'Musical, Drama, Romance' 'Drama, Music, Musical' 'Drama, Mystery' 'Adventure, Comedy, Romance' 'Crime, Drama, Horror' 'Family, Music, Musical' 'Action, Musical, Thriller' 'Action, Romance, Thriller' 'Romance, Thriller' 'Drama, Music' 'Crime, Drama, Musical' 'Action, Crime, Mystery' 'Action, Adventure, Thriller' 'Comedy, Romance, Sci-Fi' 'Crime' 'Action, Drama, Mystery' 'Action, Comedy, Thriller' 'Biography, Drama' 'Action, Comedy, Fantasy' 'Drama, Family, Horror' 'Action, Adventure, Family' 'Documentary, Biography, Musical' 'Action, Drama, Musical' 'Adventure, Thriller' 'Crime, Mystery' 'Drama, Crime' 'Drama, Fantasy, Romance' 'Comedy, Romance, Thriller' 'Musical, Comedy, Drama' 'Biography, History, War' 'Action, Comedy, Romance' 'Drama, History, Musical' 'Action, Crime, Horror' 'Adventure, Fantasy' 'Adventure, Drama, Fantasy' 'Adventure, Fantasy, Romance' 'Action, Adventure, Drama' 'Action, Adventure' 'Comedy, Crime' 'Crime, Drama, Fantasy' 'Adventure, Drama, Romance' 'History' 'Drama, Fantasy, Thriller' 'Musical, Fantasy' 'Documentary, Thriller' 'Mystery, Romance, Musical' 'Family, Drama, Romance' 'History, Musical, Romance' 'Musical, Drama, Crime' 'Adventure, Crime, Romance' 'Musical, Thriller, Mystery' 'Drama, Comedy' 'Biography, Drama, Romance' 'Biography, Music' 'Biography, Drama, Music' 'Drama, Sci-Fi' 'Drama, Family, Thriller' 'Comedy, Musical, Romance' 'Drama, Family, Comedy' 'Action, Thriller, Romance' 'Animation, Adventure' 'Action, Crime, Musical' 'Action, Crime, Romance' 'Animation, Action, Adventure' 'Action, Drama, Sport' 'Comedy, History' 'Documentary, History' 'Drama, Comedy, Family' 'Action, Adventure, Crime' 'Documentary, Biography, Music' 'Comedy, Musical' 'Biography, Crime, Thriller' 'Adventure, Mystery, Thriller' 'Biography, Drama, Sport' 'Action, Comedy, Musical' 'Mystery, Romance, Thriller' 'Action, Adventure, Musical' 'Crime, Musical, Mystery' 'Action, Thriller, Crime' 'Adventure, Comedy, Crime' 'Comedy, Horror, Musical' 'Adventure, Family' 'Family, Thriller' 'Drama, Action, Crime' 'Drama, War' 'Action, Drama, Adventure' 'Adventure, Fantasy, History' 'Fantasy, Musical' 'Comedy, Drama, Thriller' 'Drama, Fantasy' 'Musical, Drama' 'Action, Drama, Horror' 'Biography, Crime, Drama' 'Action, Drama, Music' 'Adventure, Drama, Family' 'Drama, Romance, Musical' 'Comedy, Musical, Drama' 'Adventure, Comedy, Musical' 'Crime, Drama, Family' 'Thriller, Musical, Mystery' 'Documentary, Adventure, Crime' 'Drama, Action, Horror' 'Adventure, Crime, Drama' 'Documentary, Biography, Sport' 'Crime, Fantasy, Mystery' 'Documentary, Biography, Drama' 'Action, Fantasy, Thriller' 'Adventure, Drama, History' 'Animation, Drama, History' 'Comedy, Horror, Thriller' 'Drama, Family, History' 'Animation, History' 'Biography, Drama, Musical' 'Music' 'Family, Comedy' 'Adventure, Mystery' 'Family, Fantasy' 'Documentary, History, News' 'Drama, Mystery, Romance' 'Comedy, Fantasy' 'Action, Crime, Family' 'Drama, Musical, Mystery' 'Action, Thriller, Mystery' 'Drama, Family, Fantasy' 'Action, Family' 'Action, Adventure, Mystery' 'Horror, Fantasy' 'Comedy, Action' 'Adventure, Romance' 'Drama, Adventure' 'Animation, Drama, Romance' 'Comedy, Crime, Romance' 'Adventure, Comedy' 'Comedy, Drama, Sport' 'Documentary, Crime, History' 'Musical, Mystery, Drama' 'Adventure, Drama, Sci-Fi' 'Action, Romance, Western' 'Comedy, Fantasy, Romance' 'Animation, Action, Comedy' 'Drama, Fantasy, Sci-Fi' 'Drama, Horror' 'Family, Drama, Comedy' 'Action, Adventure, History' 'Comedy, Family, Romance' 'Biography, History' 'Animation, Family' 'Drama, Fantasy, History' 'Animation, Adventure, Fantasy' 'Adventure, Comedy, Family' 'Drama, History, War' 'Animation, Drama, Fantasy' 'Action, Musical, Romance' 'Crime, Action, Drama' 'Comedy, Romance, Musical' 'Fantasy, Drama' 'Musical, Action, Crime' 'Documentary, Drama' 'Action, Horror, Thriller' 'Action, Horror, Sci-Fi' 'Mystery, Sci-Fi, Thriller' 'Biography, Family' 'Drama, Action, Comedy' 'Drama, Music, Romance' 'Action, Biography, Crime' 'Adventure, Drama, Musical' 'Family, Music, Romance' 'Fantasy, Mystery, Romance' 'Drama, Crime, Family' 'Drama, Family, Action' 'Romance, Comedy, Drama' 'Animation, Adventure, Comedy' 'Sci-Fi, Thriller' 'Romance, Family, Drama' 'Action, Family, Thriller' 'Adventure, Crime, Thriller' 'Drama, Romance, Sport' 'Comedy, Crime, Mystery' 'Adventure, Comedy, Mystery' 'Action, Fantasy' 'Comedy, Mystery' 'Animation, Adventure, Family' 'Adventure, Drama, Music' 'Biography, Drama, War' 'Documentary, Comedy, Drama' 'Musical, Drama, Family' 'Animation, Comedy, Drama' 'Fantasy, Musical, Drama' 'Adventure, Crime, Mystery' 'Comedy, Drama, Mystery' 'Documentary, News' 'Drama, Musical, Family' 'Action, Romance, Drama' 'Comedy, Crime, Thriller' 'Action, Musical' 'Action, History' 'Action, Comedy, Mystery' 'Drama, Family, Mystery' 'Adventure, Drama, Thriller' 'Documentary, Reality-TV' 'Action, Fantasy, Horror' 'Drama, History, Thriller' 'Documentary, Family' 'Documentary, Biography, Family' 'Comedy, Sport' 'Animation, Comedy, Family' 'Crime, Romance, Thriller' 'Comedy, Musical, Action' 'Action, Mystery, Sci-Fi' 'Comedy, Crime, Musical' 'Drama, Adventure, Action' 'History, Romance' 'Reality-TV' 'Fantasy, History' 'Family, Drama, Thriller' 'Musical, Mystery, Thriller' 'Musical, Comedy, Romance' 'Musical, Action, Drama' 'Action, Musical, War' 'Romance, Comedy' 'Horror, Crime, Thriller' 'Crime, Drama, History' 'Comedy, Drama, Horror' 'Crime, Horror, Thriller' 'Animation, Comedy' 'Romance, Action, Crime' 'Musical, Thriller' 'Action, Romance, Comedy' 'Comedy, Family, Musical' 'Horror, Drama, Mystery' 'Thriller, Mystery, Family' 'Comedy, Drama, Sci-Fi' 'Documentary, Adventure' 'Documentary, Biography, Crime' 'Musical, Action' 'Musical, Mystery' 'Action, Crime, Sci-Fi' 'Action, Horror, Mystery' 'Fantasy, Horror' 'Adventure, Family, Fantasy' 'Fantasy, Sci-Fi' 'Comedy, War' 'Romance, Action, Drama' 'Musical, Family, Romance' 'Romance, Drama, Action' 'Family, Comedy, Drama' 'Comedy, Music, Romance' 'Comedy, Family, Sci-Fi' 'Action, Drama, Western' 'Adventure, Romance, Thriller' 'Biography, Comedy, Drama' 'Action, Mystery, Romance' 'Romance, Sport' 'Crime, Romance' 'Action, Thriller, Western' 'Crime, Musical, Romance' 'Romance, Thriller, Mystery' 'Drama, Crime, Mystery' 'Biography, Drama, Family' 'Action, Family, Mystery' 'Comedy, Mystery, Romance' 'Drama, Thriller, Action' 'Documentary, Short' 'Documentary, Western' 'Musical, Family, Drama' 'Action, Family, Musical' 'Animation, Family, Musical' 'Drama, Fantasy, Horror' 'Action, Adventure, Sci-Fi' 'Drama, Action, Musical' 'Drama, Musical, Sport' 'Action, Comedy, Horror' 'Drama, Fantasy, Musical' 'Action, Fantasy, Musical' 'Animation, Action' 'Comedy, Music' 'Documentary, Drama, Romance' 'Drama, Music, Thriller' 'Fantasy, Musical, Mystery' 'Drama, Fantasy, War' 'Action, War' 'Action, Adventure, War' 'Horror, Musical' 'Fantasy, Mystery, Thriller' 'Adventure, Biography, Drama' 'Family, Romance, Sci-Fi' 'Drama, Romance, Family' 'Animation, Adventure, Drama' 'Family, Romance, Drama' 'Animation, Action, Sci-Fi' 'Adventure, Comedy, Fantasy' 'Comedy, Crime, Family' 'Horror, Musical, Thriller' 'Biography, Drama, Thriller' 'Drama, Western' 'Romance, Sci-Fi, Thriller' 'Comedy, Musical, Family' 'Comedy, Horror, Romance' 'Thriller, Action' 'Fantasy, Thriller, Action' 'Fantasy, Romance' 'Action, Drama, Comedy' 'Family, Fantasy, Romance' 'Comedy, Crime, Horror' 'Horror, Mystery, Sci-Fi' 'Animation, Action, Drama' 'Family, Mystery' 'Adventure, Biography, History' 'Fantasy, Horror, Mystery' 'Family, Musical' 'Drama, Family, Adventure' 'Crime, Horror, Mystery' 'Documentary, Drama, Fantasy' 'Action, Adventure, Biography' 'Biography, History, Thriller' 'Action, Family, Drama' 'Documentary, Drama, Sport' 'Thriller, Mystery' 'Musical, Drama, Comedy' 'Documentary, History, War' 'Adventure, Horror, Thriller' 'Action, Adventure, Horror' 'Action, Crime, War' 'Adventure, Musical, Romance' 'Action, Fantasy, Sci-Fi' 'Drama, Comedy, Action' 'Documentary, Sport' 'Documentary, Adventure, Music' 'Drama, Action, Family' 'Adventure, History, Thriller' 'Adventure, Horror, Romance' 'Adventure, Crime, Horror' 'Mystery, Musical, Romance' 'Action, Crime, History' 'Documentary, Musical' 'Adventure, Fantasy, Musical' 'Documentary, Family, History' 'Documentary, Drama, Family' 'Drama, Mystery, Sci-Fi' 'Animation, Drama, Musical' 'Drama, History, Mystery' 'Drama, Sport, Thriller' 'Action, Crime, Fantasy' 'Comedy, Musical, Mystery' 'Romance, Musical, Action' 'Musical, Drama, Fantasy' 'Animation, Family, History' 'Action, Drama, News' 'Romance, Musical, Comedy' 'Adventure, Fantasy, Horror' 'Adventure, History' 'Comedy, Drama, History' 'Mystery, Sci-Fi' 'Action, Thriller, War' 'Documentary, Drama, News' 'Documentary, Crime, Mystery' 'Adventure, Horror' 'Animation, Drama, Adventure' 'Crime, Horror, Romance' 'Documentary, Adventure, Drama' 'Documentary, Biography, History' 'Fantasy, Horror, Romance' 'Comedy, Fantasy, Musical' 'Crime, Musical, Thriller' 'Documentary, War' 'Action, Comedy, War' 'Crime, Drama, Sport' 'Musical, Adventure, Drama' 'Horror, Romance, Sci-Fi' 'Musical, Mystery, Romance' 'Romance, Musical, Drama' 'Adventure, Fantasy, Sci-Fi'] ------------------------------------------------------------ Name of Column: Votes [nan '8' '35' ... '70,344' '408' '1,496'] ------------------------------------------------------------ Name of Column: Director ['J.S. Randhawa' 'Gaurav Bakshi' 'Soumyajit Majumdar' ... 'Mozez Singh' 'Ved Prakash' 'Kiran Thej'] ------------------------------------------------------------ Name of Column: Actor 1 ['Manmauji' 'Rasika Dugal' 'Sayani Gupta' ... 'Meghan Jadhav' 'Roohi Berde' 'Sangeeta Tiwari'] ------------------------------------------------------------ Name of Column: Actor 2 ['Birbal' 'Vivek Ghamande' 'Plabita Borthakur' ... 'Devan Sanjeev' 'Prince Daniel' 'Sarah Jane Dias'] ------------------------------------------------------------ Name of Column: Actor 3 ['Rajendra Bhatia' 'Arvind Jangid' 'Roy Angana' ... 'Shatakshi Gupta' 'Valerie Agha' 'Suparna Anand'] ------------------------------------------------------------
# Handling the null values
imdb.dropna(subset=['Name', 'Year', 'Duration', 'Rating', 'Votes', 'Director', 'Actor 1', 'Actor 2', 'Actor 3'], inplace=True)
#Extracting only the text part from the Name column
imdb['Name'] = imdb['Name'].str.extract('([A-Za-z\s\'\-]+)')
# Replacing the brackets from year column as observed above
imdb['Year'] = imdb['Year'].str.replace(r'[()]', '', regex=True).astype(int)
# Convert 'Duration' to numeric and replacing the min, while keeping only numerical part
imdb['Duration'] = pd.to_numeric(imdb['Duration'].str.replace(r' min', '', regex=True), errors='coerce')
# Splitting the genre by , to keep only unique genres and replacing the null values with mode
imdb['Genre'] = imdb['Genre'].str.split(', ')
imdb = imdb.explode('Genre')
imdb['Genre'].fillna(imdb['Genre'].mode()[0], inplace=True)
# Convert 'Votes' to numeric and replace the , to keep only numerical part
imdb['Votes'] = pd.to_numeric(imdb['Votes'].str.replace(',', ''), errors='coerce')
#checking duplicate values by Name and Year
duplicate = imdb.groupby(['Name', 'Year']).filter(lambda x: len(x) > 1)
duplicate.head(5)
| Name | Year | Duration | Genre | Rating | Votes | Director | Actor 1 | Actor 2 | Actor 3 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Yaaram | 2019 | 110 | Comedy | 4.4 | 35 | Ovais Khan | Prateik | Ishita Raj | Siddhant Kapoor |
| 3 | Yaaram | 2019 | 110 | Romance | 4.4 | 35 | Ovais Khan | Prateik | Ishita Raj | Siddhant Kapoor |
| 5 | Aur Pyaar Ho Gaya | 1997 | 147 | Comedy | 4.7 | 827 | Rahul Rawail | Bobby Deol | Aishwarya Rai Bachchan | Shammi Kapoor |
| 5 | Aur Pyaar Ho Gaya | 1997 | 147 | Drama | 4.7 | 827 | Rahul Rawail | Bobby Deol | Aishwarya Rai Bachchan | Shammi Kapoor |
| 5 | Aur Pyaar Ho Gaya | 1997 | 147 | Musical | 4.7 | 827 | Rahul Rawail | Bobby Deol | Aishwarya Rai Bachchan | Shammi Kapoor |
# Dropping the duplicated values by Name
imdb = imdb.drop_duplicates(subset=['Name'], keep=False)
imdb.describe()
| Year | Duration | Rating | Votes | |
|---|---|---|---|---|
| count | 1528.000000 | 1528.000000 | 1528.000000 | 1528.000000 |
| mean | 1997.972513 | 123.823953 | 5.976243 | 552.479712 |
| std | 21.181921 | 25.108144 | 1.412547 | 4311.631841 |
| min | 1931.000000 | 45.000000 | 1.600000 | 5.000000 |
| 25% | 1985.000000 | 107.000000 | 5.100000 | 14.000000 |
| 50% | 2004.000000 | 126.000000 | 6.100000 | 34.000000 |
| 75% | 2016.000000 | 140.000000 | 7.000000 | 127.250000 |
| max | 2021.000000 | 300.000000 | 9.400000 | 101014.000000 |
imdb.describe(include = 'O')
| Name | Genre | Director | Actor 1 | Actor 2 | Actor 3 | |
|---|---|---|---|---|---|---|
| count | 1528 | 1528 | 1528 | 1528 | 1528 | 1528 |
| unique | 1528 | 20 | 1114 | 1010 | 1131 | 1154 |
| top | Gadhvi | Drama | Kanti Shah | Mithun Chakraborty | Mithun Chakraborty | Pran |
| freq | 1 | 789 | 13 | 22 | 12 | 16 |
# Find the row with the highest number of votes
max_votes_row = imdb[imdb['Votes'] == imdb['Votes'].max()]
# Get the name of the movie with the highest votes
movie_highest_votes = max_votes_row['Name'].values[0]
# Find the number of votes for the movie with the highest votes
votes_highest_votes = max_votes_row['Votes'].values[0]
print("Movie with the highest votes:", movie_highest_votes)
print("Number of votes for the movie with the highest votes:", votes_highest_votes)
print('\n', '='*100, '\n')
# Find the row with the lowest number of votes
min_votes_row = imdb[imdb['Votes'] == imdb['Votes'].min()]
# Get the name of the movie with the lowest votes
movie_lowest_votes = min_votes_row['Name'].values[0]
# Find the number of votes for the movie with the lowest votes
votes_lowest_votes = min_votes_row['Votes'].values[0]
print("Movie with the highest votes:", movie_lowest_votes)
print("Number of votes for the movie with the highest votes:", votes_lowest_votes)
Movie with the highest votes: My Name Is Khan Number of votes for the movie with the highest votes: 101014 ==================================================================================================== Movie with the highest votes: Anmol Sitaare Number of votes for the movie with the highest votes: 5
# Find the row with the highest rating
max_rating_row = imdb[imdb['Rating'] == imdb['Rating'].max()]
movie_highest_rating = max_rating_row['Name'].values[0]
votes_highest_rating = max_rating_row['Votes'].values[0]
print("Movie with the highest rating:", movie_highest_rating)
print("Number of votes for the movie with the highest rating:", votes_highest_rating)
print('\n', '='*100, '\n')
# Find the row with the lowest rating
min_rating_row = imdb[imdb['Rating'] == imdb['Rating'].min()]
movie_lowest_rating = min_rating_row['Name'].values[0]
votes_lowest_rating = min_rating_row['Votes'].values[0]
print("Movie with the highest rating:", movie_lowest_rating)
print("Number of votes for the movie with the highest rating:", votes_lowest_rating)
Movie with the highest rating: June Number of votes for the movie with the highest rating: 18 ==================================================================================================== Movie with the highest rating: Mumbai Can Dance Saalaa Number of votes for the movie with the highest rating: 43
# Group the dataset by the 'Director' column and count the number of movies each director has directed
director_counts = imdb['Director'].value_counts()
# Find the director with the highest number of movies directed
most_prolific_director = director_counts.idxmax()
num_movies_directed = director_counts.max()
print("Director with the most movies directed:", most_prolific_director)
print("Number of movies directed by", most_prolific_director, ":", num_movies_directed)
print('\n', '='*100, '\n')
# Group the dataset by the 'Director' column and count the number of movies each director has directed
director_counts = imdb['Director'].value_counts()
# Find the director with the lowest number of movies directed
least_prolific_director = director_counts.idxmin()
num_movies_directed = director_counts.min()
print("Director with the most movies directed:", least_prolific_director)
print("Number of movies directed by", most_prolific_director, ":", num_movies_directed)
Director with the most movies directed: Kanti Shah Number of movies directed by Kanti Shah : 13 ==================================================================================================== Director with the most movies directed: Sikandar Khanna Number of movies directed by Kanti Shah : 1
fig_year = px.histogram(imdb, x = 'Year', histnorm='probability density', nbins = 30, color_discrete_sequence = colors)
fig_year.update_traces(selector=dict(type='histogram'))
fig_year.update_layout(title='Distribution of Year', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Year', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_year.show()
fig_duration = px.histogram(imdb, x = 'Duration', histnorm='probability density', nbins = 40, color_discrete_sequence = colors)
fig_duration.update_traces(selector=dict(type='histogram'))
fig_duration.update_layout(title='Distribution of Duration', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Duration', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_duration.show()
fig_rating = px.histogram(imdb, x = 'Rating', histnorm='probability density', nbins = 40, color_discrete_sequence = colors)
fig_rating.update_traces(selector=dict(type='histogram'))
fig_rating.update_layout(title='Distribution of Rating', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Rating', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_rating.show()
fig_votes = px.box(imdb, x = 'Votes', color_discrete_sequence = colors)
fig_votes.update_layout(title='Distribution of Votes', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Votes', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig_votes.show()
The distribution of Year is skewed towards left with a high concentration of movies directed in between 2015-2019
The duration of movies has a guassian distribution with a very few outliers
The distribution of Rating is also having a guassian distribution with a high concentration of 6.6 and 6.7
The number of votes has a plenty of outliers
year_avg_rating = imdb.groupby('Year')['Rating'].mean().reset_index()
top_5_years = year_avg_rating.nlargest(10, 'Rating')
fig = px.bar(top_5_years, x='Year', y='Rating', title='Top 10 Years by Average Rating', color = "Rating", color_continuous_scale = "darkmint")
fig.update_xaxes(type='category')
fig.update_layout(xaxis_title='Year', yaxis_title='Average Rating', plot_bgcolor = 'white')
fig.show()
# Group data by Year and calculate the average rating
average_rating_by_year = imdb.groupby('Year')['Rating'].mean().reset_index()
# Create the line plot with Plotly Express
fig = px.line(average_rating_by_year, x='Year', y='Rating', color_discrete_sequence=['#559C9E'])
fig.update_layout(title='Are there any trends in ratings across year?', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Year', yaxis_title='Rating', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig.show()
# Group data by Year and calculate the average rating
average_rating_by_year = imdb.groupby('Year')['Votes'].mean().reset_index()
# Create the line plot with Plotly Express
fig = px.line(average_rating_by_year, x='Year', y='Votes', color_discrete_sequence=['#559C9E'])
fig.update_layout(title='Are there any trends in votes across year?', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Year', yaxis_title='Votes', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig.show()
# Group data by Year and calculate the average rating
average_rating_by_year = imdb.groupby(['Year', 'Genre'])['Rating'].mean().reset_index()
# Get the top 3 genres
top_3_genres = imdb['Genre'].value_counts().head(3).index
# Filter the data to include only the top 3 genres
average_rating_by_year = average_rating_by_year[average_rating_by_year['Genre'].isin(top_3_genres)]
# Create the line plot with Plotly Express
fig = px.line(average_rating_by_year, x='Year', y='Rating', color = "Genre", color_discrete_sequence=['#559C9E', '#0B1F26', '#00CC96'])
# Customize the layout
fig.update_layout(title='Average Rating by Year for Top 3 Genres', xaxis_title='Year', yaxis_title='Average Rating', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor='white')
# Show the plot
fig.show()
fig_dur_rat = px.scatter(imdb, x = 'Duration', y = 'Rating', trendline='ols', color = "Rating", color_continuous_scale = "darkmint")
fig_dur_rat.update_layout(title='Does length of movie have any impact on rating?', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Duration of Movie in Minutes', yaxis_title='Rating of a movie', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig_dur_rat.show()
fig_dur_votes = px.scatter(imdb, x = 'Duration', y = 'Votes', trendline='ols', color = "Votes", color_continuous_scale = "darkmint")
fig_dur_votes.update_layout(title='Does length of movie have any impact on Votes?', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Duration of Movie in Minutes', yaxis_title='Votes of a movie', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig_dur_votes.show()
fig_rat_votes = px.scatter(imdb, x = 'Rating', y = 'Votes', trendline='ols', color = "Votes", color_continuous_scale = "darkmint")
fig_rat_votes.update_layout(title='Does Ratings of movie have any impact on Votes?', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Ratings of Movies', yaxis_title='Votes of movies', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig_rat_votes.show()
# Dropping non essential columns
imdb.drop('Name', axis = 1, inplace = True)
# Grouping the columns with their average rating and then creating a new feature
genre_mean_rating = imdb.groupby('Genre')['Rating'].transform('mean')
imdb['Genre_mean_rating'] = genre_mean_rating
director_mean_rating = imdb.groupby('Director')['Rating'].transform('mean')
imdb['Director_encoded'] = director_mean_rating
actor1_mean_rating = imdb.groupby('Actor 1')['Rating'].transform('mean')
imdb['Actor1_encoded'] = actor1_mean_rating
actor2_mean_rating = imdb.groupby('Actor 2')['Rating'].transform('mean')
imdb['Actor2_encoded'] = actor2_mean_rating
actor3_mean_rating = imdb.groupby('Actor 3')['Rating'].transform('mean')
imdb['Actor3_encoded'] = actor3_mean_rating
# Keeping the predictor and target variable
X = imdb[[ 'Year', 'Votes', 'Duration', 'Genre_mean_rating','Director_encoded','Actor1_encoded', 'Actor2_encoded', 'Actor3_encoded']]
y = imdb['Rating']
# Splitting the dataset into training and testing parts
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
# Building 2 machine learning models and training them
lr = LinearRegression()
lr.fit(X_train,y_train)
lr_pred = lr.predict(X_test)
rf = RandomForestRegressor()
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
# Evaluating the performance of trained algos
print('The performance evaluation of Logistic Regression is below: ', '\n')
print('Mean squared error: ',mean_squared_error(y_test, lr_pred))
print('Mean absolute error: ',mean_absolute_error(y_test, lr_pred))
print('R2 score: ',r2_score(y_test, lr_pred))
print('\n', '='*100, '\n')
print('The performance evaluation of Random Forest Regressor is below: ', '\n')
print('Mean squared error: ',mean_squared_error(y_test, rf_pred))
print('Mean absolute error: ',mean_absolute_error(y_test, rf_pred))
print('R2 score: ',r2_score(y_test, rf_pred))
The performance evaluation of Logistic Regression is below: Mean squared error: 0.13007622782536266 Mean absolute error: 0.2507994097724827 R2 score: 0.935188545523222 ==================================================================================================== The performance evaluation of Random Forest Regressor is below: Mean squared error: 0.11293861764705879 Mean absolute error: 0.18987908496732034 R2 score: 0.9437274881146624
# Checking a sample of predictor values on whom the model is trained
X.head()
| Year | Votes | Duration | Genre_mean_rating | Director_encoded | Actor1_encoded | Actor2_encoded | Actor3_encoded | |
|---|---|---|---|---|---|---|---|---|
| 1 | 2019 | 8 | 109 | 6.420152 | 7.000 | 6.850000 | 7.000000 | 7.000 |
| 10 | 2004 | 17 | 96 | 6.420152 | 6.200 | 5.766667 | 5.100000 | 6.200 |
| 11 | 2016 | 59 | 120 | 4.698529 | 5.900 | 5.900000 | 5.900000 | 5.900 |
| 30 | 2005 | 1002 | 116 | 6.420152 | 6.525 | 6.900000 | 6.866667 | 5.700 |
| 32 | 1993 | 15 | 168 | 6.420152 | 5.400 | 5.600000 | 6.400000 | 5.825 |
# Checking the rating according to above predictor variables
y.head()
1 7.0 10 6.2 11 5.9 30 7.1 32 5.6 Name: Rating, dtype: float64
# Creating a new dataframe with values close to the 3rd row according to the sample above
data = {'Year': [2016], 'Votes': [58], 'Duration': [121], 'Genre_mean_rating': [4.5], 'Director_encoded': [5.8], 'Actor1_encoded': [5.9], 'Actor2_encoded': [5.9], 'Actor3_encoded': [5.900]}
df = pd.DataFrame(data)
# Predict the movie rating
predicted_rating = rf.predict(df)
# Display the predicted rating
print("Predicted Rating:", predicted_rating[0])
Predicted Rating: 5.849999999999995
In this Jupyter notebook project, we embarked on a journey to analyze and predict movie ratings. We encountered a variety of data challenges, such as missing values, typos in column names, and duplicated records. Through a series of data cleaning and preprocessing steps, we were able to prepare the dataset for analysis. Our analysis uncovered several interesting insights about the movie dataset. We observed trends in movie durations, genre popularity, the most prolific actors and directors, and the distribution of movie ratings and votes over the years. Notably, we found that short-duration movies tend to receive higher ratings and votes, and the Drama genre has consistently performed well in terms of ratings. Furthermore, our evaluation of machine learning models revealed that Random Forest outperformed Linear Regression, with an impressive R-squared score of 0.94 on unseen data, highlighting the model's robustness.
Insights:
Following are the main insights:
Data cleaning was essential, involving the correction of typos and handling missing/duplicated values.
We explored the temporal dimension of the data, noting the first entry in 1931 and a movie with just 45 minutes of duration.
Mithun is the most frequently appearing lead actor
We identified both the best and worst-performing movies in terms of votes and ratings.
Insights on directors with the most and least movies were gained.
The distribution of movies over the years is skewed, with a concentration in the 2015-2019 period.
In 2010, some movies had the highest average votes.
Short-duration movies tend to receive higher ratings and votes, indicating a potential preference for shorter films.
Drama is a consistently popular genre, while Comedy and Action genres had their origins in 1953 and 1964, respectively.
The distribution of ratings and votes follows Gaussian-like patterns, with specific peaks and trends over time.
Random Forest regression outperformed Linear Regression with an R-squared score of 0.94, indicating its robustness.
Through our analysis, we gained a deep understanding of the dataset and its trends. This knowledge can be leveraged to make informed decisions regarding movie production, genres, and more. Future work could involve building more advanced machine learning models or diving deeper into specific genres or time periods to uncover additional insights.
What's next?
To further enhance this project, one can consider the following:
Explore additional machine learning models and fine-tune their parameters to improve rating prediction accuracy.
Investigate the relationships between various features, such as the impact of specific actors, directors, or genres on movie ratings.
Conduct sentiment analysis on movie reviews or incorporate external data sources to gain more insights into factors influencing movie ratings.
Create visualizations and dashboards to make the insights more accessible and engaging for stakeholders.
Continue updating the dataset with new movie releases and ratings for ongoing analysis and predictions.